Part 1 - Data Exploration

Eu Jin Lok
Kernel post for Speech Accent Archive on Kaggle
2 December 2019

Understanding the data and setting an objective

In this notebook we will go into the details of how to explore audio data and converge on an objective, an objective which will most likely involve some kind of deep learning, because its awesome. If I do write a blog post about this, I will update this kernel. But for now, just the Jupyter notebook as a kernel, and my very first one!

Before we begin, I just wanted to say that my first real heavy involvement in audio was back in March 2018 whilst doing the Audio competition on kaggle, Thanks to that competition and the awesome community support, I had learnt alot and so I wanted to contribute back to the community in the same way. So without further ado, lets begin

In [25]:
import pandas as pd       
import os 
import math 
import numpy as np
import matplotlib.pyplot as plt  
import IPython.display as ipd  # To play sound in the notebook
import librosa
import librosa.display
os.chdir(".../input")
#os.chdir("C:\\Users\\User\\Documents\\GIT\\Kaggle-Kernel-Speech-Accent-Archive\\")
---------------------------------------------------------------------------
FileNotFoundError                         Traceback (most recent call last)
<ipython-input-25-715a209907e9> in <module>()
      7 import librosa
      8 import librosa.display
----> 9 os.chdir(".../input")
     10 #os.chdir("C:\\Users\\User\\Documents\\GIT\\Kaggle-Kernel-Speech-Accent-Archive\\")

FileNotFoundError: [WinError 3] The system cannot find the path specified: '.../input'

After loading the files and setting our directory path, lets check out the meta datafile to see what we're dealing with

In [24]:
#load the data 
df = pd.read_csv("speakers_all.csv", header=0)

# Check the data
print(df.shape, 'is the shape of the dataset') 
print('------------------------') 
print(df.head())
(2172, 12) is the shape of the dataset
------------------------
    age  age_onset              birthplace  filename native_language   sex  \
0  24.0       12.0         koussi, senegal   balanta         balanta  male   
1  18.0       10.0          buea, cameroon  cameroon        cameroon  male   
2  48.0        8.0  hong, adamawa, nigeria  fulfulde        fulfulde  male   
3  42.0       42.0   port-au-prince, haiti   haitian         haitian  male   
4  40.0       35.0   port-au-prince, haiti   haitian         haitian  male   

   speakerid   country  file_missing?  Unnamed: 9  Unnamed: 10 Unnamed: 11  
0        788   senegal           True         NaN          NaN         NaN  
1       1953  cameroon           True         NaN          NaN         NaN  
2       1037   nigeria           True         NaN          NaN         NaN  
3       1165     haiti           True         NaN          NaN         NaN  
4       1166     haiti           True         NaN          NaN         NaN  

I noticed some strange empty columns in the last 3 columns of the dataset. Lets clean it up and run some more stats

In [3]:
df.drop(df.columns[9:12],axis = 1, inplace = True)
print(df.columns)
df.describe()
Index(['age', 'age_onset', 'birthplace', 'filename', 'native_language', 'sex',
       'speakerid', 'country', 'file_missing?'],
      dtype='object')
Out[3]:
age age_onset speakerid
count 2172.000000 2172.000000 2172.000000
mean 33.117173 8.833333 1088.449355
std 14.453039 8.451127 628.420329
min 0.000000 0.000000 1.000000
25% 22.000000 0.000000 543.750000
50% 28.000000 8.000000 1088.500000
75% 41.000000 13.000000 1632.250000
max 97.000000 86.000000 2176.000000
In [4]:
# Very rough plot
df['country'].value_counts().plot(kind='bar')
Out[4]:
<matplotlib.axes._subplots.AxesSubplot at 0x2568fd9f940>
In [5]:
# Ok so that wasn't a very good idea. Lets try something else... 
df['native_language'].value_counts().plot(kind='bar')
Out[5]:
<matplotlib.axes._subplots.AxesSubplot at 0x2568fe62dd8>
In [6]:
# That's lots of categories too! Ok so maybe lets try and visualise this in a different way...
df.groupby("native_language")['age'].describe().sort_values(by=['count'],ascending=False)
Out[6]:
count mean std min 25% 50% 75% max
native_language
english 579.0 34.482729 16.734510 6.0 21.00 29.0 44.00 90.0
spanish 162.0 34.129630 13.972528 17.0 23.00 30.0 45.00 80.0
arabic 102.0 30.950980 12.047248 18.0 21.25 28.0 38.00 70.0
mandarin 65.0 30.015385 8.193924 18.0 24.00 28.0 34.00 53.0
french 63.0 33.333333 16.246091 18.0 21.00 27.0 39.00 78.0
korean 52.0 33.230769 12.718083 18.0 22.00 29.5 43.50 62.0
russian 48.0 34.145833 15.533066 18.0 24.00 28.5 37.00 84.0
portuguese 48.0 29.625000 10.487328 18.0 22.00 26.0 36.25 65.0
dutch 47.0 28.765957 11.876562 18.0 21.00 23.0 35.50 68.0
turkish 37.0 25.081081 5.889625 18.0 21.00 24.0 27.00 45.0
german 36.0 31.250000 13.601208 18.0 21.00 28.0 36.00 77.0
polish 34.0 29.764706 15.889884 18.0 20.00 23.5 32.50 97.0
italian 33.0 36.000000 14.197271 18.0 23.00 33.0 47.00 78.0
japanese 27.0 35.037037 13.633899 18.0 25.00 29.0 44.50 69.0
macedonian 26.0 20.692308 2.276299 19.0 19.25 20.0 20.75 29.0
cantonese 23.0 23.956522 4.781387 18.0 20.00 22.0 28.50 33.0
farsi 23.0 33.695652 12.564144 19.0 23.00 29.0 42.50 63.0
vietnamese 22.0 34.863636 15.489768 18.0 23.25 29.0 44.75 69.0
swedish 20.0 28.100000 10.900990 18.0 20.75 23.0 32.25 56.0
romanian 20.0 36.950000 11.550689 19.0 27.00 35.0 46.25 62.0
amharic 20.0 28.450000 8.287435 19.0 22.50 27.5 31.00 52.0
bulgarian 18.0 27.333333 9.145877 18.0 21.25 24.5 30.75 53.0
hindi 18.0 33.611111 14.345070 19.0 23.50 27.5 42.75 64.0
tagalog 18.0 41.388889 19.671236 18.0 21.00 40.5 56.25 77.0
serbian 18.0 30.333333 6.799654 22.0 25.25 28.0 33.00 47.0
bengali 17.0 32.294118 13.142321 18.0 24.00 26.0 40.00 61.0
urdu 16.0 24.187500 6.939921 19.0 20.75 22.0 24.25 48.0
thai 15.0 28.733333 12.908838 18.0 21.00 25.0 31.50 69.0
greek 15.0 35.933333 18.952824 18.0 22.50 26.0 44.00 81.0
nepali 13.0 27.384615 6.982579 19.0 23.00 25.0 36.00 39.0
... ... ... ... ... ... ... ... ...
mankanya 1.0 27.000000 NaN 27.0 27.00 27.0 27.00 27.0
mandinka 1.0 23.000000 NaN 23.0 23.00 23.0 23.00 23.0
mandingo 1.0 36.000000 NaN 36.0 36.00 36.0 36.00 36.0
lamotrekese 1.0 38.000000 NaN 38.0 38.00 38.0 38.00 38.0
agni 1.0 25.000000 NaN 25.0 25.00 25.0 25.00 25.0
lingala 1.0 37.000000 NaN 37.0 37.00 37.0 37.00 37.0
kalanga 1.0 30.000000 NaN 30.0 30.00 30.0 30.00 30.0
northern 1.0 36.000000 NaN 36.0 36.00 36.0 36.00 36.0
kabyle 1.0 26.000000 NaN 26.0 26.00 26.0 26.00 26.0
nuer 1.0 30.000000 NaN 30.0 30.00 30.0 30.00 30.0
shilluk 1.0 42.000000 NaN 42.0 42.00 42.0 42.00 42.0
malagasy 1.0 44.000000 NaN 44.0 44.00 44.0 44.00 44.0
sesotho 1.0 52.000000 NaN 52.0 52.00 52.0 52.00 52.0
serer 1.0 39.000000 NaN 39.0 39.00 39.0 39.00 39.0
sarua 1.0 21.000000 NaN 21.0 21.00 21.0 21.00 21.0
sardinian 1.0 35.000000 NaN 35.0 35.00 35.0 35.00 35.0
sa'a 1.0 52.000000 NaN 52.0 52.00 52.0 52.00 52.0
rwanda 1.0 45.000000 NaN 45.0 45.00 45.0 45.00 45.0
gedeo 1.0 51.000000 NaN 51.0 51.00 51.0 51.00 51.0
rundi 1.0 36.000000 NaN 36.0 36.00 36.0 36.00 36.0
hainanese 1.0 45.000000 NaN 45.0 45.00 45.0 45.00 45.0
hakka 1.0 72.000000 NaN 72.0 72.00 72.0 72.00 72.0
hindko 1.0 62.000000 NaN 62.0 62.00 62.0 62.00 62.0
poonchi 1.0 25.000000 NaN 25.0 25.00 25.0 25.00 25.0
pohnpeian 1.0 18.000000 NaN 18.0 18.00 18.0 18.00 18.0
ife 1.0 30.000000 NaN 30.0 30.00 30.0 30.00 30.0
ilonggo 1.0 53.000000 NaN 53.0 53.00 53.0 53.00 53.0
irish 1.0 28.000000 NaN 28.0 28.00 28.0 28.00 28.0
jola 1.0 34.000000 NaN 34.0 34.00 34.0 34.00 34.0
zulu 1.0 24.000000 NaN 24.0 24.00 24.0 24.00 24.0

214 rows × 8 columns

In [7]:
# Check country of origin again...
df.groupby("country")['age'].describe().sort_values(by=['count'],ascending=False)
Out[7]:
count mean std min 25% 50% 75% max
country
usa 393.0 35.652672 18.044364 6.0 21.00 29.0 47.00 90.0
china 88.0 29.477273 9.654940 18.0 23.00 27.0 33.00 72.0
uk 67.0 33.104478 14.900818 18.0 20.00 30.0 38.00 71.0
india 59.0 30.864407 11.501810 18.0 22.00 28.0 35.00 64.0
canada 54.0 31.629630 14.713998 18.0 21.25 26.5 37.50 78.0
south korea 51.0 33.392157 12.786052 18.0 22.50 30.0 45.00 62.0
brazil 39.0 28.410256 8.881435 18.0 21.50 26.0 34.50 54.0
belgium 36.0 28.416667 15.648596 18.0 21.00 22.0 23.00 78.0
turkey 35.0 25.200000 5.959668 18.0 22.00 24.0 27.00 45.0
poland 34.0 29.764706 15.889884 18.0 20.00 23.5 32.50 97.0
australia 33.0 29.090909 10.156167 18.0 22.00 26.0 32.00 60.0
saudi arabia 33.0 27.696970 10.187262 18.0 22.00 25.0 29.00 57.0
germany 32.0 32.687500 14.578901 18.0 21.00 27.0 41.75 77.0
italy 32.0 35.406250 14.577622 18.0 23.00 32.0 47.25 78.0
russia 31.0 32.612903 15.233029 18.0 23.00 26.0 35.00 68.0
ethiopia 31.0 33.096774 11.317700 19.0 24.50 31.0 37.00 58.0
france 28.0 31.892857 14.476902 19.0 20.75 27.5 39.00 67.0
japan 26.0 35.423077 13.752594 18.0 25.00 29.5 44.75 69.0
macedonia 26.0 20.692308 2.276299 19.0 19.25 20.0 20.75 29.0
nigeria 23.0 43.347826 10.559824 23.0 35.50 46.0 48.00 65.0
philippines 23.0 39.608696 18.612652 18.0 21.00 39.0 50.50 77.0
iran 22.0 33.045455 11.688323 19.0 24.75 29.0 41.25 63.0
iraq 22.0 38.363636 12.045728 20.0 30.50 35.5 46.00 62.0
spain 22.0 36.136364 14.567725 18.0 24.25 33.0 44.75 69.0
vietnam 22.0 34.863636 15.489768 18.0 23.25 29.0 44.75 69.0
colombia 22.0 36.727273 17.501515 17.0 25.75 31.5 45.25 80.0
nicaragua 21.0 29.095238 13.928046 18.0 18.00 20.0 40.00 58.0
pakistan 21.0 26.904762 10.985922 18.0 21.00 23.0 26.00 62.0
taiwan 20.0 38.400000 14.228955 23.0 27.75 32.5 47.00 72.0
bulgaria 19.0 27.105263 8.943618 18.0 21.50 24.0 30.50 53.0
... ... ... ... ... ... ... ... ...
namibia 1.0 44.000000 NaN 44.0 44.00 44.0 44.00 44.0
barbados 1.0 20.000000 NaN 20.0 20.00 20.0 20.00 20.0
bahrain 1.0 20.000000 NaN 20.0 20.00 20.0 20.00 20.0
niger 1.0 43.000000 NaN 43.0 43.00 43.0 43.00 43.0
antigua and barbuda 1.0 28.000000 NaN 28.0 28.00 28.0 28.00 28.0
andorra 1.0 24.000000 NaN 24.0 24.00 24.0 24.00 24.0
virginia 1.0 29.000000 NaN 29.0 29.00 29.0 29.00 29.0
yemen 1.0 21.000000 NaN 21.0 21.00 21.0 21.00 21.0
yugoslavia 1.0 36.000000 NaN 36.0 36.00 36.0 36.00 36.0
zambia 1.0 25.000000 NaN 25.0 25.00 25.0 25.00 25.0
montenegro 1.0 26.000000 NaN 26.0 26.00 26.0 26.00 26.0
martinique 1.0 21.000000 NaN 21.0 21.00 21.0 21.00 21.0
trinidad 1.0 31.000000 NaN 31.0 31.00 31.0 31.00 31.0
israel (occupied territory) 1.0 38.000000 NaN 38.0 38.00 38.0 38.00 38.0
sicily 1.0 77.000000 NaN 77.0 77.00 77.0 77.00 77.0
rwanda 1.0 45.000000 NaN 45.0 45.00 45.0 45.00 45.0
guatemala 1.0 50.000000 NaN 50.0 50.00 50.0 50.00 50.0
gabon 1.0 26.000000 NaN 26.0 26.00 26.0 26.00 26.0
faroe islands 1.0 28.000000 NaN 28.0 28.00 28.0 28.00 28.0
oman 1.0 32.000000 NaN 32.0 32.00 32.0 32.00 32.0
lesotho 1.0 52.000000 NaN 52.0 52.00 52.0 52.00 52.0
romanian 1.0 46.000000 NaN 46.0 46.00 46.0 46.00 46.0
liechtenstein 1.0 54.000000 NaN 54.0 54.00 54.0 54.00 54.0
luxembourg 1.0 54.000000 NaN 54.0 54.00 54.0 54.00 54.0
madagascar 1.0 44.000000 NaN 44.0 44.00 44.0 44.00 44.0
chad 1.0 21.000000 NaN 21.0 21.00 21.0 21.00 21.0
burundi 1.0 36.000000 NaN 36.0 36.00 36.0 36.00 36.0
the bahamas 1.0 19.000000 NaN 19.0 19.00 19.0 19.00 19.0
malawi 1.0 42.000000 NaN 42.0 42.00 42.0 42.00 42.0
equatorial guinea 1.0 20.000000 NaN 20.0 20.00 20.0 20.00 20.0

176 rows × 8 columns


There's more native languages than there are countries which I suppose makes sense. Although still a hypothesis withstanding. A sankey type plot here would be interesting but lets park it for now as a seperate task. Right now, lets continue on with our main objective...

In [8]:
# Create DTM of counts 
df.groupby("sex")['age'].describe()
Out[8]:
count mean std min 25% 50% 75% max
sex
famale 1.0 65.000000 NaN 65.0 65.0 65.0 65.0 65.0
female 1048.0 34.072519 15.337869 0.0 22.0 29.0 43.0 89.0
male 1123.0 32.197240 13.492936 0.0 22.0 28.0 39.0 97.0

hmmm... must be a typo. Lets notify @Rachel Tatman about this observation. But for now, lets continue on

In [9]:
# birthplace
df.groupby("birthplace")['age'].describe().sort_values(by=['count'],ascending=False)
Out[9]:
count mean std min 25% 50% 75% max
birthplace
seoul, south korea 25.0 32.040000 12.300000 18.0 23.00 27.0 40.00 62.0
skopje, macedonia 21.0 20.047619 1.160870 19.0 19.00 20.0 20.00 24.0
hong kong, china 19.0 23.473684 4.753577 18.0 19.50 22.0 27.00 33.0
addis ababa, ethiopia 16.0 28.875000 8.663140 20.0 22.50 27.5 31.00 52.0
bogota, colombia 14.0 36.785714 20.881271 17.0 21.00 30.5 48.25 80.0
jiddah, saudi arabia 14.0 28.214286 10.445305 18.0 21.25 24.0 34.25 56.0
bilwi, puerto cabezas, nicaragua 13.0 24.769231 9.166550 18.0 18.00 20.0 32.00 41.0
tehran, iran 13.0 33.153846 12.999014 19.0 22.00 29.0 43.00 63.0
accra, ghana 13.0 38.230769 13.292566 20.0 25.00 38.0 46.00 60.0
istanbul, turkey 12.0 23.333333 4.478907 18.0 18.75 23.5 25.25 31.0
sao paulo, brazil 12.0 32.916667 10.799481 18.0 24.50 32.0 40.75 54.0
baghdad, iraq 12.0 35.500000 11.571910 20.0 27.75 34.5 41.25 62.0
bucharest, romania 10.0 37.200000 11.554701 25.0 28.50 34.5 44.00 62.0
moscow, russia 10.0 34.500000 18.608540 20.0 23.25 25.5 41.00 68.0
lima, peru 9.0 30.333333 12.884099 20.0 21.00 27.0 30.00 55.0
riyadh, saudi arabia 9.0 28.111111 12.149531 19.0 22.00 23.0 26.00 57.0
kabul, afghanistan 9.0 41.666667 15.676415 20.0 32.00 42.0 56.00 59.0
la paz, bolivia 9.0 40.111111 15.152924 22.0 25.00 48.0 53.00 58.0
freetown, sierra leone 9.0 32.444444 9.976361 21.0 26.00 32.0 36.00 48.0
belgrade, serbia 8.0 28.250000 4.496030 22.0 25.00 27.5 31.50 35.0
antwerp, belgium 8.0 22.125000 0.991031 21.0 21.00 22.5 23.00 23.0
sofia, bulgaria 8.0 33.000000 10.836446 23.0 26.00 30.5 34.75 53.0
sydney, australia 8.0 33.625000 10.225144 24.0 27.25 30.0 36.75 53.0
brooklyn, new york, usa 8.0 54.875000 21.963850 25.0 43.25 52.0 60.75 90.0
bangkok, thailand 8.0 27.875000 8.078852 18.0 21.25 27.5 34.00 39.0
manila, philippines 8.0 51.500000 20.500871 18.0 38.00 53.5 64.00 77.0
tokyo, japan 8.0 30.875000 11.331845 20.0 24.00 26.0 37.00 53.0
paris, france 8.0 30.000000 16.248077 20.0 20.75 22.5 30.75 67.0
washington, district of columbia, usa 8.0 30.750000 18.195368 6.0 18.75 28.5 38.00 58.0
beirut, lebanon 7.0 40.285714 17.036376 19.0 28.50 40.0 48.00 70.0
... ... ... ... ... ... ... ... ...
ibadan, oyo, nigeria 1.0 40.000000 NaN 40.0 40.00 40.0 40.00 40.0
ibadan, nigeria 1.0 47.000000 NaN 47.0 47.00 47.0 47.00 47.0
hyderabad, andhra pradesh, india 1.0 38.000000 NaN 38.0 38.00 38.0 38.00 38.0
huron, south dakota, usa 1.0 57.000000 NaN 57.0 57.00 57.0 57.00 57.0
huhot, nei meng gu, china 1.0 32.000000 NaN 32.0 32.00 32.0 32.00 32.0
janow, poland 1.0 26.000000 NaN 26.0 26.00 26.0 26.00 26.0
hue, vietnam 1.0 29.000000 NaN 29.0 29.00 29.0 29.00 29.0
hudson, new york, usa 1.0 48.000000 NaN 48.0 48.00 48.0 48.00 48.0
hucknall, nottinghamshire, england, uk 1.0 35.000000 NaN 35.0 35.00 35.0 35.00 35.0
hubballi, karnataka, india 1.0 64.000000 NaN 64.0 64.00 64.0 64.00 64.0
huanuco, provincia dos de mayo, peru 1.0 32.000000 NaN 32.0 32.00 32.0 32.00 32.0
hsinchu, taiwan 1.0 25.000000 NaN 25.0 25.00 25.0 25.00 25.0
ilbague, colombia 1.0 57.000000 NaN 57.0 57.00 57.0 57.00 57.0
imo state, imo state, nigeria 1.0 23.000000 NaN 23.0 23.00 23.0 23.00 23.0
imphal, india 1.0 19.000000 NaN 19.0 19.00 19.0 19.00 19.0
ingolstadt, germany 1.0 51.000000 NaN 51.0 51.00 51.0 51.00 51.0
innsbruck, austria 1.0 35.000000 NaN 35.0 35.00 35.0 35.00 35.0
ioannina, greece 1.0 37.000000 NaN 37.0 37.00 37.0 37.00 37.0
iowa city, iowa, usa 1.0 22.000000 NaN 22.0 22.00 22.0 22.00 22.0
iquitos, peru 1.0 30.000000 NaN 30.0 30.00 30.0 30.00 30.0
irbid, jordan 1.0 55.000000 NaN 55.0 55.00 55.0 55.00 55.0
irvine, scotland, uk 1.0 35.000000 NaN 35.0 35.00 35.0 35.00 35.0
isle of arran, scotland, uk 1.0 20.000000 NaN 20.0 20.00 20.0 20.00 20.0
izmail, ukraine 1.0 50.000000 NaN 50.0 50.00 50.0 50.00 50.0
jaipur, india 1.0 43.000000 NaN 43.0 43.00 43.0 43.00 43.0
jalandhar, india 1.0 42.000000 NaN 42.0 42.00 42.0 42.00 42.0
jalisco, mexico 1.0 45.000000 NaN 45.0 45.00 45.0 45.00 45.0
jammu, kashmir, india 1.0 25.000000 NaN 25.0 25.00 25.0 25.00 25.0
jamshedpur, jharkhand, india 1.0 29.000000 NaN 29.0 29.00 29.0 29.00 29.0
lewisville, texas, usa 1.0 21.000000 NaN 21.0 21.00 21.0 21.00 21.0

1290 rows × 8 columns


Birthplace is a very sparce datapoint with 1290 unique categories with very few observations in each one. Again could be interesting to see the patterns of Birthplace and Country relationship. Either a Network analysis or a Sankey plot. May shed some light on whether all the high Seoul birthplace observation equates to country. Ie. Could they be South Koreans living else where such as China or USA? And for the last one...

In [10]:
# file_missing
df.groupby("file_missing?")['age'].describe().sort_values(by=['count'],ascending=False)
Out[10]:
count mean std min 25% 50% 75% max
file_missing?
False 2140.0 33.080607 14.444245 0.0 22.0 28.0 41.0 97.0
True 32.0 35.562500 15.063173 18.0 24.0 35.0 42.0 73.0

32 missing files. What does this actually mean? I read the overview page and there's no mention of this. So, lets go see it for ourselves...

In [11]:
# Count the total audio files given
print (len([name for name in os.listdir('recordings\\') if os.path.isfile(os.path.join('recordings\\', name))]))
2138

huh? We have 2 missing audio files. Well, I suppose the one sure way to tell is if we did a join between the

In [12]:
# file_missing column. This time we just print out the first 10 records. 
df.groupby("filename")['age'].describe().sort_values(by=['count'],ascending=False).head(10)
Out[12]:
count mean std min 25% 50% 75% max
filename
haitian 6.0 36.333333 13.952300 18.0 25.75 41.0 42.75 54.0
swiss 5.0 30.200000 8.318654 21.0 24.00 30.0 34.00 42.0
nicaragua 4.0 37.000000 12.569805 20.0 32.75 39.0 43.25 50.0
jamaican 3.0 36.333333 28.307832 19.0 20.00 21.0 45.00 69.0
liberian 2.0 33.000000 7.071068 28.0 30.50 33.0 35.50 38.0
hawai'i 2.0 71.000000 2.828427 69.0 70.00 71.0 72.00 73.0
afrikaans1 1.0 27.000000 NaN 27.0 27.00 27.0 27.00 27.0
mandarin46 1.0 43.000000 NaN 43.0 43.00 43.0 43.00 43.0
mandarin42 1.0 47.000000 NaN 47.0 47.00 47.0 47.00 47.0
mandarin43 1.0 24.000000 NaN 24.0 24.00 24.0 24.00 24.0

Wait, there's some files that have the same filename. But closer inspection, I suspect these filenames also have missing audio files. In which case it is ok. So, lets have a look at the final column (Not exactly final but SpeakerID should be excusable right?)

In [13]:
# The file_missing? column. Again, just print the first 10 record 
df.groupby("filename")['file_missing?'].describe().sort_values(by=['count'],ascending=False).head(10)
# pd.crosstab(df['filename'],df['file_missing?']) as an alternative method 
Out[13]:
count unique top freq
filename
haitian 6 1 True 6
swiss 5 1 True 5
nicaragua 4 2 True 3
jamaican 3 1 True 3
liberian 2 1 True 2
hawai'i 2 1 True 2
afrikaans1 1 1 False 1
mandarin46 1 1 False 1
mandarin42 1 1 False 1
mandarin43 1 1 False 1

The filename with duplicate names have all missing audio files. Perfect! Everything checks out. We can go ahead a read in the audio files, and listen in to a few. We'll look at 'arikaans1' and 'mandarin46' since its on our periperal vision

In [15]:
# Play afrikaans
fname1 = 'recordings\\' + 'afrikaans1.mp3'
ipd.Audio(fname1)
Out[15]:
In [16]:
# Play mandarin46
fname2 = 'recordings\\' + 'mandarin46.mp3'
ipd.Audio(fname2)
Out[16]:
In [17]:
# lets have a listen to a male voice. 
print(df.groupby("filename")['sex'].describe().head(10))
fname3 = 'recordings\\' + 'agni1.mp3'   
ipd.Audio(fname3)
           count unique     top freq
filename                            
afrikaans1     1      1  female    1
afrikaans2     1      1    male    1
afrikaans3     1      1    male    1
afrikaans4     1      1    male    1
afrikaans5     1      1    male    1
agni1          1      1    male    1
akan1          1      1    male    1
albanian1      1      1    male    1
albanian2      1      1    male    1
albanian3      1      1    male    1
Out[17]:

Ok, so we've come to a point where we need to make a decision now. There's a few objectives worth pursuing on top of my head and they are:

  • building a gender predictor from voice
  • building an accent predictor by Country from voice
  • building an accent predictor by Birthplace from voice

All we could build all 3 applications above, starting from the easiest first being the gender predictor. The gender predictor will serve as our prototype and once we've built it, we'll expand to Country, followed by Birthplace. I'm not even sure if Birthplace is viable but lets re-evaluate when we circle back to this. For now, lets run with Gender first. Also note that we don't have to limit ourselves with supervised modelling. There's many more we can do:

  • Audio fingerprinting
  • Emotion analysis (Text and Voice)
  • Speed, inflection etc etc
  • Others

There's alot you can do with audio, but we'll look at these at a later stage. Meantime, the show must go on. So lets stick to our simple objective, and lets now run a few more examples of male and female audio files. This time, I want to hear the US Southern Accent. Cause I've always liked that accent and find it fascinating.

In [18]:
print(df[df['birthplace'].str.contains("kentucky",na=False)])
fname4 = 'recordings\\' + 'english385.mp3'   
ipd.Audio(fname4)
      age  age_onset                   birthplace    filename native_language  \
420  31.0        0.0   brownsville, kentucky, usa  english150         english   
463  53.0        0.0    louisville, kentucky, usa   english19         english   
667  22.0        0.0  russellville, kentucky, usa  english373         english   
676  85.0        0.0   pike county, kentucky, usa  english381         english   
680  77.0        0.0       mcveigh, kentucky, usa  english385         english   
766  20.0        0.0       paducah, kentucky, usa  english462         english   

        sex  speakerid country  file_missing?  
420    male        527     usa          False  
463    male         76     usa          False  
667    male       1308     usa          False  
676    male       1324     usa          False  
680  female       1328     usa          False  
766    male       1564     usa          False  
Out[18]:
In [19]:
fname5 = 'recordings\\' + 'english462.mp3'   
ipd.Audio(fname5)
Out[19]:

The male version doesn't have a strong Southern accent. And there's some distrotion of the audio at the start. Could pose a problem for our accent predictor by Birthplace, but nothing to worry about for Gender. Looking at the data, seems like there's some potential age correlation here. So lets hear one final one!

In [20]:
fname6 = 'recordings\\' + 'english381.mp3'   # An older male 
ipd.Audio(fname6)
Out[20]:

Ok so we'll go ahead as our first mini-objective and that is to create a gender predictor, with the ultimate objective being to create an accent predictor. So the next logical step after this is to analyse the audio files itself and extract features from it, which we'll do in the Part 2 of this series. Meantime, I'll leave you with the wave plots of the 3 Kentucky accents, can you tell the difference / similarity?

In [22]:
# Older female 
y, sr = librosa.load(fname4)
plt.figure()
plt.subplot(3, 1, 3)
librosa.display.waveplot(y, sr=sr)
plt.title('older female')

# Older Male
y, sr = librosa.load(fname6)
plt.figure()
plt.subplot(3, 1, 3)
librosa.display.waveplot(y, sr=sr)
plt.title('older male')

# younger male
y, sr = librosa.load(fname5)
plt.figure()
plt.subplot(3, 1, 3)
librosa.display.waveplot(y, sr=sr)
plt.title('younger male')
Out[22]:
Text(0.5,1,'younger male')